Genomic data resources of the Brain Somatic Mosaicism Network for neuropsychiatric diseases

Somatic mosaicism is defined as an occurrence of two or more populations of cells having genomic sequences differing at given loci in an individual who is derived from a single zygote. It is a characteristic of multicellular organisms that plays a crucial role in normal development and disease. To study the nature and extent of somatic mosaicism in autism spectrum disorder, bipolar disorder, focal cortical dysplasia, schizophrenia, and Tourette syndrome, a multi-institutional consortium called the Brain Somatic Mosaicism Network (BSMN) was formed through the National Institute of Mental Health (NIMH). In addition to genomic data of affected and neurotypical brains, the BSMN also developed and validated a best practices somatic single nucleotide variant calling workflow through the analysis of reference brain tissue. These resources, which include >400 terabytes of data from 1087 subjects, are now available to the research community via the NIMH Data Archive (NDA) and are described here.

Ultimately, seven of twelve MDA reactions that passed the quality control checks were sequenced to 30X coverage.Their DNA libraries were produced using the Illumina TruSeq DNA PCR-Free kit and sequencing was performed on an Illumina HiSeqX platform.The remaining five wells underwent size selection purification with the S1 Maker 10 bp High Pass Protocol for high molecular weight fragments greater than 10 kb.These libraries were PCR amplified, diluted, and then sequenced on an Illumina HiSeqX sequencer 8 .
In a different approach for validating methods of calling mosaic variation, known proportions of genomic DNA derived from the unrelated grandparents of CEPH/Utah pedigree 1463, a family containing the well-characterized subject NA12878, were mixed and sequenced.Phenol chloroform extraction was performed on pedigree 1463 lymphoblastoid cell lines GM12889, GM12891, GM12890, and GM12892 and aliquots of the resulting DNA were mixed in three different proportions, with one sample being processed without mixing (Table 2).DNA libraries were produced using the Illumina PCR-free TruSeq DNA Library Prep kit for 1 µg of mixed DNA, and library quantification was checked using qPCR and the KAPA Library Quantification kit before WGS was conducted on an Illumina HiSeq X instrument 8 .Alignment files were produced using the uniform processing pipeline described in the Data Records section.

Somatic mosaicism reference data -NDA collection 2968.
A unique dataset for the study of somatic mosaicism is available in NDA collection 2968.Twenty-five tissue samples from the brain, heart, liver, and kidneys were obtained from a neurotypical 70-year-old Caucasian female (UCSD-19-110) within a 24-hr postmortem interval.Samples from case number UCSD-19-110 were obtained through the UC San Diego Anatomical Material Program.Samples obtained from human cadavers are exempt from oversight via a state regulated IRB.Approval and study oversight were instead performed by the UC San Diego Anatomical Materials Committee (approval no.106135).Note that the phenotype of this subject is marked as global geriatric decline (GGD).The prefrontal, frontal, parietal, occipital, and temporal lobes of both hemispheres of the brain were dissected, and thirteen 8 mm diameter 1 cm thick punches were collected.Two punches also were collected from the heart, liver, cerebellum, and each kidney.
Tissue punches of the cortical lobes were separated into "Sml" and "Lrg" categories.Each "Sml" sample was taken from a centralized punch of its respective lobe.The remaining punches of each site were homogenized followed by nuclear extraction and were marked as "Lrg." Note that these categories are included in their related file names in the data repository.
Each category of tissue underwent a series of processing steps.The Lrg punches were used for nuclear preparations, where they were homogenized by grinding at −196 °C, motorized homogenization (1% formaldehyde in PBS), rocking at room temperature for 10 min, and then the addition of 0.125 M glycine with another 5 min of rocking.After centrifugation, the samples were kept at 0 °C, washed twice (10 mM Tris-HCl pH 8.0, 1 mM EDTA, 5 mM MgCl 2 , 0.1 M sucrose, 0.5% Triton X-100), centrifuged (1100RCF, 5 min, 4 °C), resuspended in 5 ml washing buffer, and then dounced five times.Homogenates were kept at 0 °C for 30 min before being dounced twenty times.Particulate matter was removed (70 µm sieve opening), and further purification was performed with a sucrose cushion (1.2 M sucrose, 1 M Tris-HCl pH 8.0, 1 mM MgCl 2 , 0.1 M DTT; 3200RCF, 30 min).The resulting nuclear fractions were stored at −80 °C.
These nuclear fraction homogenates (Lrg), homogenized central tissue punch of lobar tissue (Sml), and punches of the cerebellum, left and right kidneys, liver, and heart were used for high depth (300X) WGS.DNA extraction and purification were performed using the AllPrep DNA/RNA Mini kit (Qiagen) according to protocol for −80 °C stabilized tissue.Library preparation was performed using the KAPA HyperPrep PCR-Free Library Prep kit.The Agilent DNA High Sensitivity NGS Fragment Analysis and KAPA Library Quantification kits were used for quality control and quantification.Sequencing was performed on an Illumina NovaSeq.6000  1. Summary of the assays applied to tissues of the neurotypical somatic mosaicism reference brain in C2458.Four assays were applied to various tissues and cells (WES = whole exome sequencing, WGS = whole genome sequencing, LR = linked read, sc = single cell sequencing).Tissues included uniformly pulverized dorsolateral prefrontal cortex (uDLPFC), frozen chunk of dorsolateral prefrontal cortex (cDLPFC), fibroblasts cultured from the dura mater, NeuN+ neurons, NeuN− cell fraction, the cerebellum, and the dura mater.
platform.BAM alignment files were outputted by the Illumina DRAGEN Bio-IT platform.They were converted to fastq files (PICARD v2.20.7) to be aligned to GRCh37d5 with bwa mem (v0.7.17) with parameters set to use soft clipping for supplementary alignments [15][16][17] .The alignment files were sorted before indel regions were realigned with GATK (v3.8-1) and recalibrated according to best practices 18 .Coverage for the WGS samples, marked under UCSD-19-110, is shown in Supplementary Fig. 1).HaplotypeCaller was used to find germline variants 14,19 .Nuclear isolation buffer (0.25 M sucrose, 25 mM KCl, 5 mM MgCl 2 , 10 mM Tris pH 7.5, 100 mM DTT, 0.1% Triton X-100; 0 °C) and douncing were used to produce a homogenate of tissue from the left temporal lobe for single nuclei sorting and MPAS.The resulting sample was rocked and washed similarly to the nuclear extraction described for the Lrg samples, and it was stained using a 30-min incubation with Millipore Sigma NeuN Alexa Fluor 488 with a dilution factor of 1:2500.DAPI staining was performed with 0.5 µg/mL, and the nuclei were sorted on a Becton-Dickinson BD InFlux Cytometer into a 96-well plate.Whole genome amplification was performed with a REPLI-g Single Cell kit 14 .
In addition to WGS, the nuclear fractions were also sorted for cell origin (neurons, oligodendrocytes, astrocytes, and microglia) for massive parallel amplicon sequencing (7000X) targeting variant sites.This protocol involved washing the pellets twice (HBSS without Mg 2+ and Ca 2+ , 5% BSA, 1 mM EDTA) for antibody staining and FANS.Incubation occurred overnight at 4 °C with three unconjugated antibodies (Abcam #ab31940 1:1000 Table 2. Overview of the samples with mixed cell lines.Genomic DNA extracted from lymphoblastoid cell lines of four unrelated individuals was mixed in different proportions before 1 microgram of the resulting mixture was processed and sequenced.For example, Mix 1 is comprised of a 1:2:4:18 ratio.These data are available in C2458.MPAS was performed on the single nuclei temporal cortex samples, the sorted cell fractions, the Sml homogenates, and the Lrg nuclear fractions.AmpliSeq software was used to design primers for the AmpliSeq Custom DNA Panel.The AmpliSeq Library PLUS kit was used to produce DNA libraries.Indexing was performed with AmpliSeq CD Indexes.Sequencing was completed with an Illumina NovaSeq.6000.The fastq files were aligned to GRCh37d5 with bwa mem (v0.7.17), and recalibration was performed with GATK best practices (v3.8-1) [14][15][16][17][18]    and Mental Hygiene IRB no.12-14; Western IRB no.20111080) following provision of informed consent.The secondary use of these tissues for genomic studies was approved by the Boston Children's Hospital Institutional Review Board (protocol S07-02-0087).
The majority of the data of collection 2962 is composed of high depth whole genome sequencing (≥210X coverage) of either the DLPFC or the prefrontal cortex tissue of 59 ASD-affected and 15 unaffected subjects.Additional tissues that underwent WGS include two occipital lobe samples of controls and the saliva of parents of two affected individuals.DNA extraction was performed with the QIAamp DNA Mini kit lysis buffer, phenol chloroform extraction, and isopropanol workup.
One of two methods was applied to obtain high depth WGS data.The first method involved sequencing of seven libraries per sample produced by the Illumina TruSeq DNA PCR-Free library prep kit on an Illumina HiSeqX Ten platform to yield a combined coverage of ≥210X.The second method used sequencing of a single library produced with the Illumina TruSeq Nano DNA library kit on an Illumina HiSeqX Ten for 200X coverage in one run (Supp.Fig. 1) 5 .
The data were aligned to GRCh37d5 using bwa mem 0.7.8 [15][16][17] .Variant calling was performed with Mutect-Panel-of-Normals according to GATK 3.5 best practices 18,20 .Point mutation variants identified within segmental duplications and non-diploid regions were filtered from the results.The Genome Aggregation Database was used to set a threshold for filtering variants at a maximum population minor allele frequency above 1 × 10 −5 21 Population polymorphisms were also removed 22 .Variants outside a variant allele frequency between 0.02 to 0.40 were also filtered.MosaicForecast was used for phasing and identifying postzygotic variants 9 .
Autism spectrum disorder -NDA collection 2960.Brain tissues of the DLPFC, cerebellum, inferior temporal cortex, primary visual cortex, and striatum were extracted from 40 individuals (n = 14 ASD-affected, n = 5 ASD suspected, and n = 21 unaffected controls) to undergo RNA sequencing.These samples were obtained from the NIH NeuroBioBank, the University of Maryland Brain and Tissue Bank, and Yale University, using the consent procedures of each respective institution.All samples were obtained postmortem.Study approval was provided through the Yale University Human Research Protection Program (IRB no.0605001466).Tissue dissection procedures are described in Kang et al., 2011 23 .The total RNA was isolated from the samples using the mirVana kit (Ambion).Deviations from the manufacturer's protocols included each tissue sample being pulverized with liquid nitrogen in a pre-chilled mortar and pestle prior to being transferred to a chilled safe-lock microcentrifuge tube.An equal mass of chilled stainless steel beads (Next Advance, catalog #SSB14B) relative to the mass of the tissue and an equivalent volume of lysis/binding buffer were added.The tissue was homogenized for one minute in a Bullet Blender (Next Advance) before being incubated at 37 °C for one minute.Another nine volumes of the lysis/binding buffer were added and were homogenized for one minute before incubation at 37 °C for two minutes.The miRNA Homogenate Additive was added at one-tenth the volume of the lysate, and extraction proceeded according to the manufacturer's protocol.The RNA samples were then treated with DNase using the TURBO DNA-free kit (Ambion/Life Technologies) and the RNA integrity was measured using Agilent 2200 TapeStation System.
Barcoded libraries were produced with 5 ng of RNA using TruSeq Stranded Total RNA HT Sample Prep Kit with Ribo-Zero Gold kit according to the manufacturer's protocol (Illumina).Paired-end sequencing was performed on Illumina HiSeq.4000 platforms (100 bp × 2) at the Yale Center for Genome Analysis.4) samples from Sheppard Pratt (n = 43 bipolar-depressed, n = 73 bipolar-mania, n = 15 bipolar marked neither as mania nor depression, n = 8 major depression, n = 124 controls).Samples obtained from each biorepository obtained consent according to the tissue banks' internal procedures.Brain and blood samples were studied at the Kennedy Krieger Institute with the approval of the Johns Hopkins University IRB no.NA_00001324.
Samples from a set of 30 pedigrees were obtained from the NRGR (https://www.nimhgenetics.org/).Each family contained a proband under the age of 18 as well as a trio of mother, father, and child.Genomic DNA was derived from either lymphoblastoid cell lines (n = 194) or whole blood (n = 6).These samples were genotyped on an Illumina Infinium SNP array (GSA MG v2, 766,221 SNPs; Macrogen, Inc.).After quality control was completed, n = 199 samples were analyzed.Genotyping results were summarized in.ped and.map files and further quality control was performed with SNPduo and PLINK software 24,25 .Whole exome sequencing was performed on an Illumina HiSeq 2500 platform (Macrogen, Inc.) at 200X average depth of coverage.After further quality control was performed with FastQC, germline and mosaic variant calling were performed with a Sentieon workflow on Google Cloud Platform 26 .
Strict filters were applied to the SMRI data.These included the removal of population polymorphisms with allele frequencies greater than 0.001 in gnomAD (2.1.0) 21.Any somatic SNV calls within repetitive or low complexity intervals were removed, as were any calls with less than five supporting reads.If any SNV calls were within 500 bp of germline calls or were near one another by 1,000 bp they were also filtered.Variants outside the 1000 Genomes Phase 3 mappability mask were removed 22 .Variants were also removed if the alternate allele was supported in the Platinum Genomes 200X WGS NA12878 data 31 .
The Sheppard Pratt cohort consists of n = 43 bipolar-depressed, n = 73 bipolar-mania, n = 15 bipolar marked as neither mania nor depression, and n = 8 major depression affected individuals, as well as n = 124 subjects unaffected by bipolar disorder or depression.Blood samples were used to establish lymphoblastoid cell lines before genomic DNA was purified using a DNeasy Blood and Tissue kit.Target enrichment and DNA library preparation were performed with the Illumina Nextera library kit followed by hybrid capture using Illumina rapid capture enrichment of the 37 Mb target.Whole exome sequencing was conducted at the Broad Institute on Illumina HiSeqX instruments.Sequencing was run until hybrid selection libraries met or exceeded 85% of targets at 20X with 150 bp paired-end reads, comparable to approximately 55X mean coverage (Supp.Fig. 2).
Data delivery per sample included demultiplexing and aggregation into a BAM file, with reads aligned using bwa to the human genome build 38 (GRCh38), which was processed through a pipeline based on the PICARD suite of software tools 15,16 .Single nucleotide polymorphism and insertions/deletions were jointly called across all samples using the GATK HaplotypeCaller package (4.0.10) to produce a version 4.2 variant callset file (VCF) 19 .Variant call accuracy was estimated using the GATK Variant Quality Score Recalibration (VQSR) approach 18,19 .
For sequencing with 10X Genomics linked read technology, samples from n = 30 postmortem individuals were obtained from the LIBD, consisting of individuals diagnosed with BP (n = 20) or neurotypical controls (n = 10).Samples were matched for sex (n = 7/20 females with BP, n = 4/10 female controls), race (3/20 African Americans in bipolar cohort), age at death (average 44.1 years for bipolar cohort), and postmortem interval (range 9 to 54.5 hours, mean 29.6 hours for bipolar cohort).The cause of death was suicide in n = 12 BP cases but no control cases.The DLPFC was dissected to obtain approximately 100 mg per sample.High molecular weight genomic DNA was obtained (Qiagen MagAttract protocol), the integrity was confirmed (TapeStation), and libraries were created.These samples were sequenced (Macrogen Clinical Laboratories, Inc.) to 60X average depth of coverage.Data analysis was performed using Long Ranger software (10X Genomics).One sample (Br1470) failed quality control yielding data from n = 19 bipolar disorder cases.Mosaic variants were assessed using Samovar using custom filtering scripts 32 .
Focal cortical dysplasia -NDA collection 2968.Focal cortical dysplasia (FCD) samples were obtained with approval by UC San Diego IRB no.140028 and processed using whole exome sequencing.Informed consent was obtained from all participants or their legal guardians at the time of enrollment.Diagnostic criteria for the FCD samples included detection of malformations of cerebral cortical development due to altered neuronal proliferation or migration via neuroimaging or an abnormal electroencephalogram (EEG) from focal seizures without cortical malformations.After surgery, a pathological examination was performed on the specimen to determine the FCD subtype following established criteria 33 .
Biopsies of brain samples were predetermined by preoperative scans and checked during the surgical procedure with the aid of a neuronavigation system.The biopsied tissue was then fractionated into two parts; one was sent for anatomopathological study and another for genetic analysis.A venous blood sample, saliva sample, or both were also collected during surgery.
Brain tissue, peripheral blood, and saliva-derived DNA samples underwent exome capture.To this end, DNA was extracted using a QIAamp DNA Mini kit (Qiagen) or DNA Purification kit (Promega).Libraries were prepared from 250 ng of genomic DNA, and exonic regions were captured by using the Illumina Nextera Rapid Capture Exome kit or Agilent SureSelect (Human All Exon 50 Mb) and SureSelectXT Exome Capture kits.Paired-end reads of either 100 bp or 125 bp in size were sequenced on an Illumina HiSeq 2500 sequencer with V1 or V4 kits.Two libraries from each brain sample were sequenced to achieve a coverage of approximately 300X and one library was sequenced for the blood or saliva samples to reach 100X coverage (Supp.Fig. 2).
The data processing procedure to yield CRAM files was identical to the BSMN common processing pipeline described for the NRB data.Raw fastq files were also uploaded to the NDA collection.
Schizophrenia -NDA collection 2967.NDA collections 2967, 2966, and 2963 contain some subjects that are shared between repositories.All samples were obtained postmortem and were de-identified.Shared subject samples were obtained through the LIBD Human Brain and Tissue Repository.Donations to the LIBD were collected through the Office of the Chief Medical Examiner of the State of Maryland under the Maryland Department of Health's IRB protocol #12-24.Audiotaped and witnessed informed consent was obtained from the legal next-of-kin for every case at the time of autopsy.
The LIBD Autopsy telephone screening was done at time of donation with the legal next-of-kin and consisted of 39 items about the donor's medical, social, psychiatric, substance use, and treatment history.Retrospective clinical diagnostic reviews were conducted for every brain donor to include data from: autopsy reports, toxicology testing, forensic investigations, neuropathological examinations, telephone screening, and psychiatric/ substance abuse treatment record reviews and/or supplemental family informant interviews.All data was compiled and summarized in a detailed psychiatric narrative summary and was reviewed independently by two board-certified psychiatrists to determine lifetime psychiatric diagnoses according to DSM-5.Non-psychiatric control donors were excluded if they had acute drug and alcohol intoxication/use at time of death.
Additional samples were obtained from the NICHD Brain and Tissue Bank for Developmental Disorders at the University of Maryland and the University of Virginia Biorepository (NDA collection 2963).Each tissue bank followed their respective procedures for consent, donation, and obtaining and archiving samples.
The LIBD performed a series of matching procedures on case and control brain samples to maximize the likelihood of finding mosaicism in the brain associated with a diagnosis of SCZ by matching on sex, ancestry, and age at death.This selection process included choosing young subjects with an early age of onset, controlling for various confounders that compromise molecular studies of the postmortem human brain, controlling for tissue quality, and excluding individuals with evidence of prior recurrent CNVs associated with SCZ.Table 3 summarizes the dataset generated from the LIBD as NDA collection 2967.
Samples were processed for PrimeFlow RNA sorting (Thermo Fisher).One to 1.5 g of frozen human postmortem brain tissue from the DLPFC was obtained for each sample and split into aliquots of approximately 500 mg.Each 500 mg tissue sample was placed in a chilled dounce homogenizer containing 5 mL of chilled lysis buffer (0.32 M sucrose, 3 mM magnesium acetate, 5 mM calcium chloride, 0.1 mM EDTA, 10 mM Tris HCl, ph 8.0, 0.10% Triton X-100) and thawed for 5 min on ice.Samples were then dounced 50 times each with the tight pestle, and an additional 5 mL of chilled lysis buffer was added to the solution.The tissue lysate was then layered on top of 18 mL chilled sucrose buffer (1.8 M sucrose, 3 mM magnesium acetate, 10 mM Tris HCl pH 8.0) in a 38.5 mL ultracentrifuge tube (Cat.No. 344058, Beckman Coulter) to form a gradient.Samples were ultracentrifuged using a SW32 Ti rotor at 139,800RCF (28,600RPM) for 3hrs at 4 °C.
The PrimeFlow RNA Assay was performed using the assay kit from Thermo Fisher Scientific (Carlsbad, CA) according to the manufacturer's protocol, with some modifications.We optimized the PrimeFlow RNA approach to capture nuclei from specific cell types of interest using RNA labeling followed by FANS.We developed our own computational pipeline and designed and tested custom probes for neuronal nuclei (SNAP25), as well as for specific inhibitory (GAD1) and excitatory (SLC17A7, aka VGLUT1) neuronal subclasses, oligodendrocytes (MBP) and astrocytes (GFAP).Briefly, purified nuclei were fixed and permeabilized in Fixation Buffer 1 on ice for 30 min, followed by fixation in Fixation Buffer 2 at room temperature in the dark for 1 hr.To detect nuclear mRNA targets, fluorophore-tagged target probes were first diluted and hybridized to the targets at 40 °C.Signal amplification was then carried out through the hybridization of PreAmplifier Mix and then Amplifier Mix, each at 40 °C for 1 hr and stained with DAPI (final concentration of 1 µM) for 10 min on ice in the dark.Data were acquired on a MoFlow XDP Cell Sorter (Beckman Coulter, Indianapolis, IN) using appropriate laser lines and filters (Type 1, AlexaFluor 647; Type 4, AlexaFluor 488) and analyzed using FlowJo software (TreeStar, Inc., Ashland, OR).After nuclei sorting and collection, DNA and RNA were extracted using the AllPrep DNA/ RNA FFPE kit (Qiagen, Germantown, MD) following the manufacturer's protocol.
DNA libraries were prepared using either an Illumina TruSeq PCR Free kit or a Nano DNA kit based on multiple QC metrics.Each library was sequenced at 30X or 90X coverage at Psomagen using a HiSeq X or NovaSeq.6000 S4 flowcell with 150 bp paired-end reads (Supp.Fig. 1).DNA input for library preparation was 300 ng for bulk WGS and 30 ng DNA from Prime Flow RNA Assay WGS.
All WGS data were uniformly processed through the BSMN uniform processing pipeline to prepare aligned BAM files.17]34 .The merged BAM files were marked for duplicate reads using PICARD (v2.12.1).Indel realignment and base quality recalibration was performed with GATK (v3.7-0), resulting in the final uniformed processed BAM files 18 .To call structural variants (SVs) based on these BAM files, Parliament2 was run on the DNAnexus platform, a suite that includes a combination of SV callers (Breakdancer, Breakseq.2, CNVnator, Delly2, Manta, and Lumpy), SVTyper was used in genotyping, and Survivor was used to merge the higher specificity consensus calls [35][36][37] .
A total of 89 samples were used in RNA sequencing assays in two batches, with preprocessing and quality assessments from 45 unique individuals' DLPFC and hippocampus tissues.One hippocampus sample failed.Total RNA was extracted from brain tissue using the RNeasy kit (Qiagen).Paired-end 100 bp reads were generated with a targeted coverage of 80-100 million sequencing reads per sample from libraries prepared with Illumina TruSeq stranded Ribo-zero Gold kit.Our preprocessing pipeline includes quantified QC metrics calculated with FastQC (v.0.11.5), alignment with HISAT2 (v.2.0.4), counting reads for genomic features using featureCounts (subread) (v.1.5.0-p3), and quantification of ERCC synthetic transcript spike-ins using kallisto (v.0.43.0) [38][39][40] .Following completion of the preprocessing pipeline, the following in-depth QC checks on samples that failed one or more of the following tests were performed: alignment rate lower than 80%, low genotype correlation between samples labeled from the same brain, or ambiguous brain region identity score based on the expression of top genes that differentiate the two regions.All remaining samples passed additional technical QC checks for mitochondrial mapping rates, ribosomal RNA mapping rates, adapter content, and expected ERCC spike-in concentrations.

Schizophrenia -NDA collection 2966. Pulverized frozen DLPFC and hippocampal (HIPPO) brain tissue
samples from 20 deceased individuals (n = 10 SCZ-affected individuals, n = 10 unaffected controls marked as N) were obtained from the LIBD.Note that all of these subjects were also present in NDA collection 2967 and some were shared with collection 2963 but underwent different sequencing workups.Each of the SCZ and control samples were matched to one another based on gender, age, and ancestry, resulting in 10 matched pairs.Dural fibroblasts (FIBRO) were received for 16 of the 20 subjects (8 matched pairs).Dural fibroblasts were cultured for 3-10 weeks and passaged when they reached ~85-95% confluence using the following growth medium: DMEM supplemented with 10% fetal bovine serum (FBS), 2% Glutamax, and a 1% Antibiotic, Antimycotic solution, which were all obtained from Gibco/Thermo Fisher (Waltham, MA).
Whole exome sequencing was performed on each of the above DNA samples as described in the section for the neurotypical reference brain.Briefly, duplicate genomic DNA libraries were prepared in the following manner: (1) genomic DNA (75-200 ng) from each of the above samples was sheared to ~350 bp on a Covaris M220 Focused-ultrasonicator (Covaris, Woburn, MA) using microTUBE-50 and manufacturer's recommended settings; (2) sheared DNA was purified with 1X SPRIselect beads (Beckman Coulter, Pasadena, CA).Then, the NEBNext Ultra DNA Library Prep Kit for Illumina (New England Biolabs, Ipswich, MA) was used for end repair, dA-tailing, and to ligate Nextflex adapters (Perkin Elmer, Waltham, MA) onto the sheared genomic DNA.After ligation, reactions were purified with 0.65X SPRIselect beads (Beckman Coulter) and PCR enrichment of adapter-ligated DNA was performed for 10-14 cycles using NEBNext Ultra DNA Library Prep Kit (New England Biolabs); and (3) libraries were purified with 0.65X SPRIselect beads (Beckman Coulter) and quantified using a Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific, Carlsbad, CA).A 50 ng aliquot of the library was saved as a qPCR control to assess capture efficiency after exome target enrichment.The remaining 400-800 ng was used for exome target enrichment experiments.
Exome target enrichment was performed using SeqCap EZ Exome Probes v3.0 (Roche Sequencing Solutions, Pleasanton, CA) according to the manufacturer's protocols.We used a 72-hr hybridization incubation period to capture the DNA and 12-16 cycles of post-capture ligation-mediated PCR to amplify the exome enriched DNA.A Qubit dsDNA HS Assay Kit was used to quantify the captured DNA.The exome target enrichment was calculated by determining the abundance of the exome targets in the post-capture library relative to the abundance of the exome targets in the pre-capture library as specified in the protocols in SEqCap_EZ_UGuide_v5.4.Each individual library then was subjected to quality control verification and sequenced on a single lane of a HiSeq X series sequencer at Novogene Corporation (Davis, CA).Coverage is shown in Supp.Fig. 2.
Linked read sequencing was also performed on the samples.An aliquot of gDNA (1-5 ug) from the DLPFC samples from the 10 neurotypical/SCZ matched pairs was used to generate 10X Genomics (Pleasanton, CA) linked read sequencing data.The samples were sequenced on the Illumina NovaSeq.6000 platform at HudsonAlpha Genome Sequencing Center (Huntsville, AL).The resultant reads were aligned to the GRCh37 reference (refdata-b37-2.1.0)using Long Ranger v2.2 and were used to identify single nucleotide polymorphisms (SNPs), phase heterozygous variants, and assign aligned sequencing reads to their haplotype of origin.

Schizophrenia -NDA collection 2963.
Frozen pulverized brain tissues isolated from DLPFC and HIPPO of 21 individuals (11 SCZ and 10 age-matched control) were obtained from the LIBD.Note that although these subjects are shared with collections 2967 and 2966, the majority of these specimens underwent different sequencing workups from the related NDA repositories.The metadata of the shared individuals will be found in the related collections.Samples from six other subjects (five neurotypical and one with unknown condition) not shared with the other collections were also obtained, and underwent single cell sequencing.Nuclei isolation medium was added to tissue samples, which were then homogenized with a dounce homogenizer on ice.Purified nuclei were then incubated with PBS containing 5% BSA and 5 μg/mL Alexa Fluor 488 conjugated anti-NeuN antibody (Millipore Sigma) at 4 °C for 1 h.Nuclei were then stained for 10 μg/mL DAPI and sorted by FACS.Single nuclei from the NeuN− and DAPI-positive population were sorted into 384-well plates containing 1.5 μL of lysis buffer (0.2 M KOH, 0.05 M DTT), alongside two water controls.
Single nuclei underwent whole genome amplification using the PicoPLEX WGA kit, where they were incubated on ice for 10 min, incubated at 65 °C for 10 min, and then cooled to 4 °C, after which 9 μl of sample buffer, 9 μl of reaction buffer, and 1 μl phi29 enzyme (GenomiPhi HY, GE Healthcare) were added to each well.Reactions were incubated at 30 °C for 8 h and then inactivated at 65 °C for 10 min.MDA products were examined for sufficient amplification using a subset of the 47 single copy loci used in Hosono et al. 2003 as quality control markers, as they are distributed throughout the genome 41 .MDA samples that passed quality control metrics were subjected to somatic LINE-1-associated variant sequencing (SLAV-seq) library preparation 13 .
As described in Erwin et al. 2016, for preparation of SLAV-seq libraries, 10 μg of MDA product was sheared to approximately 500 bp.The sheared DNA was concentrated with Ampure beads (Beckman Coulter) before undergoing capture, annealing and extension with platinum Pfx DNA polymerase (Invitrogen; 94 °C 5 min, 61.5 °C for 30 s, 68 °C for 3 min) and sets of primers relevant to each step.Capture of LINE-1 sequences proceeded using the 5′ biotinylated oligo of sequence ATATACCTAATGCTAGATGACAC*A, with phosphorothioate linkages at the asterisk mark.For annealing, oligos included 5 ′-AA TG AT AC GG CG AC CA CC GA GA TC TA CA CN N Water control MDA products were included in the SLAV-seq procedure as negative controls.The preamplified material was handled in a separate single cell room in a laminar flow hood, and materials used in preamplification, excluding enzymes, sample and reaction buffers, were UV sterilized before use.After purification with Ampure beads, quantification with picogreen and qPCR, then adding 10-20% phiX, the samples underwent paired-end sequencing with either an Illumina HiSeq 2500 or Illumina NovaSeq platform at the Salk next-generation sequencing core 13 .

Schizophrenia -NDA collection 2965.
NeuN+ nuclei isolation was performed on DLPFC tissue samples using fluorescence-activated nuclei sorting.The Illumina PCR-free TruSeq DNA Library Prep kit was used according to the manufacturer's protocol to produce the DNA libraries of the sorted samples.Constructed libraries were quantified using the KAPA Library Quantification Kit with real-time PCR and were then sequenced on the Illumina HiSeq X Ten platform to yield 150 bp paired-end reads for all samples.Whole genome sequencing of these samples aimed for 200X coverage per sample (Supp.Fig. 1).
From the final BAM files, single nucleotide variants and small indels were called using GATK (v3.7-0)HaplotypeCaller with-ploidy option 2, 12, or 50 resulting in three raw call sets per sample 19 .The raw call sets were filtered in the following sequential steps: (1) only PASS calls were retained; (2) known germline variants were excluded based on the 1000 Genomes, ExAC, GnomAD, ESP5600, and Kaviar databases 21,22,42,43 ; (3) calls located in problematic genomic regions defined as non-P bases of the 1000 Genomes were removed; (4) a binomial test was used to remove likely heterozygous variants: for each variant the p and N parameter of the binomial null distribution was set to 0.5 and the total read count (read depth) at the variant position, respectively, and we removed the variant supporting the null hypothesis at α = 0.00001; (5) a similar binomial hypothesis test was used to filter variant calls at sites with strand bias; (6) similarly, Fisher's exact test was used to filter out calls with imbalance of strand ratios between reference (REF) and alternative (ALT) bases; (7) multi-allelic calls were removed; and (8) calls at sites with >2.5 copy number, estimated using CNVnator, were also removed 44 .
The implementation of the pipeline used for the present work was designed to run on AWS EC-2 compute instances managed by the AWS ParallelCluster software.The source code is available at https://github.com/bsmn/bsmn-pipeline.

Schizophrenia -NDA collection 2964.
The BP cohort from SMRI included samples from individuals diagnosed with SCZ (n = 35).Details are provided in the previously described BP section.
Tourette syndrome -NDA collection 2961.The analyzed samples and generated data were de-identified and derived from individuals postmortem.
Brains with Tourette Syndrome sequenced by Yale University were obtained from the Harvard's Brain and Tissue Resource Center (HBTRC).Normal brains sequenced by Yale University were obtained from NIH NeuroBioBank.The work with those brains was handled in multiple institutes under multiple IRBs.These included the University of Miami Brain Endowment Bank (IRB no.19920358 (CR0001775)), the University of Maryland Brain and Tissue Bank (IRB no.HM-HP-00042077 and IRB no.5-58), the Harvard Brain Tissue Resource Center (IRB no.2015P002028), The Human Brain and Spinal Fluid Resource Center (IRB identification via PCC# 2015-060672 and VA Project# 0002), the Mount Sinai Brain Bank (IRB no.HAR-13-059), and the Brain Tissue Donation Program at the University of Pittsburgh (IRB no.REN14120157/IRB 981146).
For normal brains obtained through autopsy authorization of Yale University, consent to utilize tissues for research and consent to publish was provided by the patient's next-of-kin according to Connecticut state law and approved protocols of the Yale University BioBank.The research study was approved by the Yale Alzheimer Disease Research Center (ADRC) and was reviewed and deemed exempt by the Yale University Institutional Review Board.The BioBank protocols are in accordance with the ethical standards of Yale University.
Handling of brains originating from and sequenced by the Lieber Institute was conducted under the approved WCG IRB protocol 20111080 titled "Collection of Postmortem Human Brain, Blood and Scalp Samples for Neuropsychiatric Research".
Nuclei were sorted as fractions into lysis buffer from the DNeasy Blood and Tissue Kit (Qiagen) by FANS (BD FACSAria ™ II).DNA was extracted using the DNeasy Blood and Tissue Kit, which included an RNase A treatment step according to the manufacturer's recommendations.DNA was eluted in 100 µl of elution buffer and the DNA concentration was measured by Qubit (Thermo Fisher).Library preparation (240 fractions: TruSeq PCR-free 450 bp, 500 ng DNA input, 32 fractions: TruSeq Nano, 450 bp, 200 ng DNA input) and WGS was performed by the New York Genome Center (NYGC) and sequenced using the Illumina HiSeq X platform to produce 150 bp paired-end reads at about 30X coverage (Supp.Fig. 1).
Bulk samples of control and TS-affected brain tissue also underwent whole genome sequencing.A mass of 0.1 g of bulk tissue was incubated at 56 °C overnight in lysis buffer from the DNeasy Blood and Tissue kit and DNA was extracted according to the manufacturer's recommendations, which included RNase A treatment.DNA was suspended in 200 µl of elution buffer and the DNA concentration was measured by Qubit (Thermo Fisher).Library preparation and WGS were performed by either Macrogen or NYGC.Both sequencing services used Illumina TruSeq DNA PCR-free library kits resulting in paired-end 150 bp reads sequenced to 100X coverage (Supp.Fig. 1).The Macrogen site used 1 µg DNA input, and NYGC used 800 ng of input.Both used Illumina platforms, and NYGC used an Illumina HiSeq X sequencer.

Data records
The ten BSMN data repositories are accessible in the NIMH Data Archive (NDA).Collection details are listed in Table 4.Additional metadata and a copy of these access links can also be accessed through the Synapse portal at https://bsmn.synapse.org/Explore/Data.Sample details can be found in Supplementary Tables 1-3.
After obtaining NDA clearance, the data of these collections can be accessed through the primary NDA website as packages, which contain both metadata tables of each collection and the locations of the data files in Amazon Web Services S3 Object Storage.Instructions for obtaining clearance and procuring the data can be    Once a data package is downloaded, the set of four key metadata tables per collection can also be reviewed.These files are described in Table 5 and connect subject Globally Unique Identifiers (GUIDs) with relevant de-identified donor, tissue sample, sequencing assay, and related information.These files also contain the S3 links of a variety of data, including WGS, WES, RNA-seq, linked read, single cell WGS-seq, and SLAV-seq of bulk tissue or fractionated samples.Assays used for each condition, along with the number of donors and NDA collection IDs, are shown in Table 6.Note that the use of these assays is not universal for all conditions, donors, and tissue samples, and the metadata tables provide information needed to determine what S3 links are most relevant.The formatting of these tables is standardized with NDA policy.
File and data types are described in the GENOMICS_SAMPLE03 metadata file of each NDA collection.Files within the repository include sequence data (FASTQs), alignment files (BAM/CRAM), the index files (BAI/ CRAI/TBI), alignment summary text files, and files for variant calls (VCFs).The availability of these file types is not universal across repositories, however.See Supp.Table 3 for available file types per sample within each collection.
In addition to NDA collections, which are repositories specific to an entire project, cross-collection subsets of data can be established using the NDA's Study feature.NDA Studies 967 and 814 will be particularly useful to Table 6.This table describes the number of donors, collections, and assays associated with each condition.Assays include whole genome (WGS), whole exome (WES), RNA-seq (RNA), single cell (SC), linked read (LR), and customized (CUS) sequencing methods.Custom sequencing includes SLAV-seq of C2963.Two neurotypical, thoroughly sequenced subjects that can be used as references are present in collections 2458 and 2698.Neurotypical controls are marked as "No Condition, Not Reference." The diagnostic category Other contains n = 4 Unknown phenotype (one in C2963, three in C2964), n = 8 major depressive disorder, n = 25 recurrent unipolar depression (RUDD), n = 30 other mental health diagnosis that is not BP, but not otherwise specified, and n = 1 CADASIL for collection 2964.researchers interested in the BSMN data.Study 967 contains all data of the BSMN 46 .Study 814 contains data that was uniformly processed using an identical alignment pipeline across the entire consortium 47 .This workflow starts with separating fastq files by flow cell lane, with the individual lanes then being aligned to GRCh37d5 with bwa v3.7.16a to obtain read group identifiers by lane.The alignment files were sorted by read group then merged by sambamba v0.6.7 before PICARD v2.12.1 was used to mark PCR duplicates.Following GATK best practices (v3.7), indel realignment and base quality recalibration were then performed to yield the final alignment files.The BSMN Pipeline applied to the Study 814 data is available in Github Repository https://github.com/bsmn/bsmn-pipeline (Table 7) 7,8 .

Technical Validation
Data quality and purity were assessed in a few different ways.Germline variants were called for every WES and WGS sample that underwent uniform processing (Supp.Tables 1, 2), which were used to determine ethnic background and quality of the data.(Note that data that was not included in the uniform processing pipeline, such as the NRGR and SMRI datasets of collection 2964 or the SLAV-seq data targeting LINE-1s of 2963, are still available through the repository ID but are not represented in the following described figures.See Supp.Table 3 for all sample IDs, regardless of uniform processing status.)The counts of these calls, which included heterozygous and Fig. 3 Variant allele frequencies (VAFs) for heterozygous SNPs were obtained for each sample using GATK HaplotypeCaller with ploidy 2. Two example distributions are shown here for samples LIBD01 and LIBD104 (collection 2967).The distribution is shifted from 50% and does not appear normally distributed for LIBD104 suggesting contamination of the sequenced sample with unrelated DNA.homozygous SNPs and indels, were compared across the data, and were found to be relatively consistent (Figs. 1, 2).Overall, two groups were observed with higher and lower counts, with consistent germline variation matching expectations for populations of African and Caucasian/Asian descent 22 .More consistency was observed for SNPs, likely reflected in more bias in capturing and calling indels, which has been observed by other studies 21 .
Some outliers of SNP counts prompted further investigation, and are marked in Fig. 2.These included samples CMC_MSSM_164 from the Mount Sinai School of Medicine (MSSM; NDA ID 2965), and Br5459 of LIBD (NDA ID 2967).Variant allele frequency (VAF) distributions were collected for the WGS data (Fig. 3; Supp.Fig. 3).Heterozygous SNP output from GATK HaplotypeCaller with ploidy 2 were plotted, with the expectation of the peak distribution to sit at a VAF of 50%.Deviation may hint at contamination from an alternative source, shown with two examples in Fig. 3, where one of the two distributions indicates a deviation.The outliers of the SNP and indel counts marked in Fig. 2 show similar deviations.We marked other potentially contaminated WGS samples with a peak of VAF distribution deviating from 50% by more than 2% in Supp. Figure 3 with red sample titles.Mapping quality and percentages are listed in Supp.Tables 1, 2. If the samples were potentially contaminated (via the VAF distribution deviation), they were also marked in Supp.Table 1.To check for variation in data quality, we also considered the fraction of indels to SNPs.The fraction of indels seen were consistent with recent studies 21 .The ratios of heterozygous and homozygous indel to SNP counts per sample are shown in Fig. 4. Overall, these ratios were consistent within the datasets.Variations were likely reflecting some variation in DNA extraction, library preparation, or sequencing.However, the differences were not affiliated with evidence of contamination.
Postmortem intervals (PMIs) were also reported for two repositories (2960 and 2961; Fig. 5a).Median PMIs for these datasets were approximately 20hrs.RIN values were also collected for the RNA-seq data of repository 2960 (Fig. 5b).Median RINs were found to be approximately 5.0.
In studies where somatic mosaic variants were reported, validation was typically performed using targeted amplicon sequencing.PCR primers were designed in silico in the range of 200 to 500 bp outside of the target.PCR-amplified targets were isolated using either agarose gel electrophoresis or a manufactured kit, and the isolated sample was re-sequenced on either a MiSeq platform or an Ion Torrent Personal Genome Machine.Control tissues were used for comparison to validated somatic variant-containing samples.Targets were generally selected based on sample availability, ability to use PCR primers at the desired location, functional significance, and range of loci 5,8 .
Additional methods of somatic mosaic variant validation included targeting multiple somatic calls at once with multiplex PCR target amplification using the CleanPlex Custom NGS Panel (Paragon Genomics) for resequencing on a MiSeq platform 8 .Additionally, digital droplet PCR (ddPCR) was applied to selected calls for validation and variant allele frequencies 8,11 .In cases of somatic mosaic retrotransposon calls, nested PCR and Sanger sequencing of the calls were also applied for further validation of the insertions 11 .Other technical validation information is detailed in Erwin et al. and Breuss et al. 13,14 .

Usage Notes
The BSMN data provide a rich source of data for further studies of mosaic variation, including multiple platforms (e.g.WES, WGS), sequencing depths, and genomic DNA sources (Table 8).NDA collection 2458 contains the WGS, WES, linked read, and single cell data of neurotypical reference brain LIBD 5154 for bulk brain tissue, NeuN+/NeuN− fractions, and dural fibroblasts.Additionally, NDA collection 2968 contains somatic mosaicism exploration of multiple samples of a mature neurotypical brain with WGS sequencing of bulk and fractionated tissue, as well as MDA and single cell MDA data of certain fractions and tissue types.The somatic single nucleotide variants were validated by PCR target amplification.The reference brain of 2458 and the mature individual in 2968 could be used either as controls for other studies or as data for the development of sensitive bioinformatics tools.
The BSMN findings and methods can be applied to the study of other neuropsychiatric conditions or any other condition in which mosaicism is implicated such as cancer, or to understand the nature of both somatic mosaicism and germline variation in the genetic diversity of individuals.The data can be used to explore the complex genetic spectrum of these conditions in conjunction with other data as well, such as PsychENCODE and the CommonMind consortia 48,49 .

Fig. 1
Fig. 1 Heterozygous and homozygous (a) SNP and (b) indel counts in targeted exonic regions.Data represents uniformly processed WES data of cohorts from the University of San Diego (NDA ID 2968), University of Michigan (NDA ID 2966), and Kennedy Krieger Institute (KKI; NDA ID 2964).Phenotype for each cohort is marked at the bottom of the figure in a rugchart.

Fig. 2
Fig. 2 Heterozygous and homozygous (a) SNP and (b) indel counts in the uniformly processed WGS data in the cohorts from Mount Sinai School of Medicine (MSSM; NDA ID 2965), University of San Diego (UCSD; NDA ID 2968), Yale University (NDA ID 2961), Harvard University (NDA ID 2962), and the Lieber Institute for Brain Development (LIBD; NDA ID 2967).Phenotype is marked at the bottom in a rugchart.Some samples with possible contamination are labeled.

Fig. 4
Fig.4 Ratios of heterozygous and homozygous indels to SNPs across all cohorts for all uniformly processed WGS data.

Fig. 5
Fig. 5 Quality of RNA-seq data.(a) Postmortem interval (PMI) of affected (n = 38) and control (n = 42) individuals in the two Yale cohorts (collections 2960 and 2961), which were the only cohorts with such a data type.Differences were not statistically significant (p > 0.05).(b) RIN values of ASD-affected (n = 63 ASD, n = 23 suspected ASD samples) and control (n = 73 neurotypical control samples) RNA-seq samples from Yale cohort 2960.

Autism spectrum disorder -NDA collection 2962.
NDA collection 2962 utilized de-identified postmortem human brain tissues that were collected by the University of Maryland Brain and Tissue Bank (an NIH NeuroBioBank; institutional review board (IRB) no.HP-00042077), the Autism BrainNet (McLean IRB nos.2006-P-0001161 and 2006-P-0001862; Western IRB no.20141029), and the LIBD (Maryland Dept of Health

Table 4 .
The 10 BSMN NDA repositories are listed.The table includes the collection ID, name, URL, number of donors, approximate repository size, and relevant condition under study for each collection.Note that some donors are shared between some repositories, and some subject IDs may represent experiments, not individuals.For example, collection 2458 lists 13 donors, but some subjects are mixed-DNA experiments, as described in the methods section.Collections 2967, 2966, and 2963 overlap with some shared subjects that underwent different types of sequencing.1087 unique subject IDs are listed across all BSMN collections.

Table 5 .
The list of metadata tables available for each NDA collection after deploying a data package for AWS Oracle access.foundat https://nda.nih.gov/nda/webinars-and-tutorials.html.The Author Contributions section describes PIs and institutions associated with each NDA collection.